03 - ggplot2 graph customization

Libraries

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Color scales

Colors in ggplot

Essentially we are going to specify colors in two manners:

  • Fixed name: example "red", "green", "blue". Limited to a handful of colors.
  • Hexadecimal numbers: example "#fff5eb", "#7f2704". Each color has a unique hexadecimal identifier. Use the links above to look for a specific color.

In this couse I will mostly use the hexadecimal notation for the colors.

Discrete variables - scale_color_brewer() and scale_fill_brewer()

This command makes sense for color scales that follow discrete variables.

  • For figures in which we define the fill aesthetic (e.g. histograms, barplots, heatmaps, box-plots…), we need to use scale_fill_brewer().
  • For figures in which we define the color aesthetic (e.g. scatterplots, lineplots, density graphs…) we need to use scale_color_brewer().

Let us thus create a couple graphs that are colored following a discrete variable (clarity) and we will adjust the color scales throughout this section:

# Color following clarity, discrete variable
p1 <- 
  ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
     geom_point()

p1

p2 <- 
  ggplot(diamonds, aes(x = color, fill = clarity)) +
    geom_bar()

p2

We will essentially have three options for the step color scales used for these discrete variables:

  • Sequential (18 palettes)
  • Qualitative (8 palettes)
  • Divergent (9 palattes)

Let us explain the logic behind them.

Palletes used by ggplot2 - RColorBrewer

The palletes used by ggplot2 come from the package RColorBrewer. There is no need for us to load this package, but it may be useful that you know this.

I am going to load the package now just to show the palletes contained in the package. These displays commands can be useful to you if you want to find specific color combinations.

library(RColorBrewer) # If you have not installed RColorBrewer install it prior
                      # to running this command.

# Display the sequential palettes
display.brewer.all(type="seq")

# Display divergent palettes
display.brewer.all(type="div")

# Display qualitative palettes
display.brewer.all(type="qual")

Sequential color palettes: type = "seq"

From this link

  • Sequential palettes are suited for ordered data that progresses from low to high. Colorwise, there is a progress from light to dark:
    • Light colors: low values
    • Dark colors: high values

The annotated image below summarizes this and provides the standard sequential palettes available in ggplot:

Sequential color brewer palettes

The colors in our graph use the variable clarity, which is a discrete measure of how clear the diamond is. The values obey a certain quality hiearchy (run ?diamonds and read about the variable):

  • I1 (worst) < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF (best)

Let us pick the sixteenth sequential scale (for example) to color our graphs:

p1 + scale_color_brewer(type = "seq", palette = 16)

p2 + scale_fill_brewer(type = "seq", palette = 16)

Divergent color palettes: type = "div"

From this link

Divergent palettes put equal emphasis on mid-range critical values and extremes at both ends of the data range.

  • Break in the middle of the legend emphasized with light colors
  • Low and high extremes are emphasized with dark colors with contrasting hues

The annotated image below summarizes this and provides the standard divergent palettes available in ggplot. It is followed by a ggplot example:

Qualitative palettes in ggplot2

Let us for example use the third qualitative palette for our figures:

p1 + scale_color_brewer(type = "div", palette = 3)

p2 + scale_fill_brewer(type = "div", palette = 3)

Qualitative color palettes: type = "qual"

From this link

Qualitative palettes do not imply magnitude differences between legend classes and are used to create visual differences between the classes.

Use this when you want to assign distinct colors to each value of the categorical without any particular ordering / hierarchy.

The annotated image below summarizes this and provides the standard qualitative palettes available in ggplot. It is followed by a ggplot example:

Qualitative palettes in ggplot

Let us, for example, apply the 6-th qualitative palette on our figure:

p1 + scale_color_brewer(type = "qual", palette = 6)

p2 + scale_fill_brewer(type = "qual", palette = 6)

How to choose a built-in palette in R

Option 1: specify palette type and paletter number

p1 + scale_color_brewer(type = "seq", palette = 12)

Careful, if you specify an index beyond the number of palettes, you will get an error

# If you run this, you will get an error
p1 + scale_color_brewer(type = "seq", palette = 19)

Option 2: specify directly the palette name

In this case further specifying the palette type has no effect. The palette name overrides the palette type.

p1 + scale_color_brewer(palette = "PuBuGn")

Manually define your own palettes.

  • For figures in which we define the fill aesthetic (e.g. histograms, barplots, heatmaps, box-plots…), we may use scale_fill_manual() to define our own color palettes specifying all the colors.
  • For figures in which we define the color aesthetic (e.g. scatterplots, lineplots, density graphs…) we may use scale_color_manual() to define our own color palettes specifying all the colors.

Resources such as this link can be used to select palettes that we may specify color by color.

Or you may directly use the colors of your choice.

As an exercise on how to do this, let us manually use a bar plot to print out the rainbow flag. I looked for the hexadecimal codes of the rainbow flag colors here.

Using these color codes (which define a palette of 6 colors), a bar plot and the function scale_fill_manual(), we may generate a rainbow flag such as the one below.

NOTE: this example is included just as an exercise/challenge on how to manually specify color variables and tweak graphs in ggplot2. You will never use ggplot2 for this in real life.

The code to generate the flag above is the following (please read carefully and understand the comments and the effect of every line):

# Named vector with the colors in hex notation
rainbow_flag_c <- 
  c("Life" = "#E40303", "Healing" = "#FF8C00", "Sunlight" = "#FFED00", 
    "Nature" = "#008026", "Serenity" = "#004dff", "Spirit" = "#750787")

# Manually created dataframe
df <- tibble(
        
          # Defined as a factor variable to 
          # ensure proper order of the colors
          # The argument levels must contain the colors
          # in the appropriate order.
          x = factor(names(rainbow_flag_c), 
                     levels = names(rainbow_flag_c)), 
          
          # Constant value to print out a flag)
          y = c(10, 10, 10, 10, 10, 10)
          )

bars_flag <- 
  
  # Define the canvas for the bar plot
  ggplot(df, aes(x, y, fill = x)) + 
  
  # stat = identity used because we specified y in aes()
  # width adjusted to remove spacing graphs
  geom_bar(stat = "identity",
           width = 1
           ) + 
  
  # Manual color scale for our levels defined with a named vector
  scale_fill_manual(values = rainbow_flag_c) +
  
  # Reverse the x-axis so that calors are printed in the proper order
  # after flipping the axes with coord_flip()
  # I included this command `a posteriori`, when I realized
  # it was necessary after checking the result of coord_flip()
  scale_x_discrete(limits=rev) +
  
  # Rotate coordinates to achieve flag effect
  coord_flip() + 
  
  # Remove  both x and y labels
  xlab("") + ylab("") +
  
  # Introduce theme modifications to set white background,
  # remove axes and remove the legend
  theme(
        panel.background=element_rect(fill='white'), # Use white background
        axis.text.x=element_blank(), # remove text from x axis
        axis.ticks.x=element_blank(), # remove ticks from x axis
        axis.text.y=element_blank(), # remove text from y axis
        axis.ticks.y=element_blank(), # remove ticks from y axis
        legend.position = "none" # remove legend
        ) +
  
  # Print the meaning of each of the colors in each of the bars
  geom_text(
            label = names(rainbow_flag_c), # Sets label within each bar
            color = "white", # Sets white color for the letters
            position = position_stack(vjust = 0.5), # Center position
            size = 10 # Adjust size
            )

bars_flag

Continuous variables - scale_color_gradient() - scale_color_gradient2() - scale_fill_gradient() - scale_fill_gradient2()

  • For figures in which we define the fill aesthetic (e.g. histograms, barplots, heatmaps, box-plots…), we need to use either scale_fill_gradient() or scale_fill_gradient2().
  • Fir figures in which we define the color aesthetic (e.g. scatterplots, lineplots, density graphs…) we need to use either scale_color_gradient() or scale_color_gradient2().

Variables from low to high - scale_color_gradient() and scale_fill_gradient()

These commands are used for continuous variables that range from a low value to a high value. This would be the continuous counterpart of the sequential discrete palettes you may use in scale_color_brewer() (see that section previously on this very same notebook).

Let us first create a base graph which is colored following a continuous variable from low to high, in this case the variable price of the diamonds dataset. ggplot automatically uses a gradient scale for the variable.

# Color following price, continuous variable
p3 <- ggplot(diamonds, aes(x = carat, y = price, color = price)) +
     geom_point()

p3

We can change the color gradient used by ggplot by specifying the low and high end of the gradient. For example, if we use the colors with the hexadecimal codes "#E1FA72" and "#F46FEE" (we will see further down how to pick colors) we get:

p3 +
  scale_color_gradient(low = "#E1FA72", high = "#F46FEE")

Or for example:

p3 +
  scale_color_gradient(low = "red", high = "green")

How to pick sensible colors for scale_color_gradient() or scale_fill_gradient()

You can use this website to select scales that have been found to work well.

  1. Open the website.
  2. Select the maximum possible number of data classes.
  3. Select sequential under nature of your data.
  4. Pick a color scheme you like.
  5. Specify the first color of that scheme as the low end of your gradient and the last color as the high end of your color gradient.

NOTE: if the first color results in a too clear low end for your gradient, pick the second or third color. The same applies if the last color is too dark.

The screenshot below signals the important points to consider. An example with ggplot follows.

Use colorbrewer2 to define a sensible color gradient

Example:

p3 +
  scale_color_gradient(low = "#f7fcfd", high = "#00441b")

Variables with a midpoint and two extremes (low-high): scale_color_gradient2() - scale_fill_greadient2()

These commands apply to continuous variables with a midpoint and two extremes. This would be the continuous counterpart of the divergent discrete color palettes you may use in scale_color_brewer() (see that section previously on this very same notebook).

How to pick sensible colors for a diverging two color gradient scale

You can use this website to select scales that have been found to work well.

  1. Open the website.
  2. Select diverging under nature of your data.
  3. Select the maximum possible number of data classes.
  4. Pick a color scheme you like.
  5. Specify the first color of that scheme as the low end of your gradient, the mid color as the midpoint of your gradient and the last color as the high end of your color gradient.

The screenshot below signals the important points to consider. An example with ggplot follows.

Select colors for diverging color gradients using colorbrewer2

NOTE: if either the low or high end of your color gradient seem to be too intense, consider picking the previous step for both ends (reduce both the low and high end for the scale to remain symmetric). For example:

Select colors for diverging color gradients using colorbrewer2

Example: correlation matrix heatmap

A classical example for a variable with two opposite ends and a relevant midpoint is the correlation coefficient. We will therefore repeat the correlation matrix heatmap example given in notebook 02, but this time we will explain in detail the possible arguments to scale_fill_gradient2()

The code to obtain aux_corr is not explained here with the same level of detail as in notebook 2. Refer to that notebook for a very detailed explanation.

1. Obtain the aux_corr dataframe. See notebook 02, heatmap example 2 for further details:**
aux_corr <- 
  
  diamonds %>% 
  
    # Selects only variables of type "numeric" for which pearsons corr. coeff 
    # makes sense
    select(where(is.numeric)) %>%
    
    # Computes correlation matrix between the numerical variables
    cor() %>% 
  
    # Turn matrix into long format dataframe
    reshape2::melt() %>% 
  
    # Round values of the correlation coefficients
    mutate(value = round(value, 2))

aux_corr
    Var1  Var2 value
1  carat carat  1.00
2  depth carat  0.03
3  table carat  0.18
4  price carat  0.92
5      x carat  0.98
6      y carat  0.95
7      z carat  0.95
8  carat depth  0.03
9  depth depth  1.00
10 table depth -0.30
11 price depth -0.01
12     x depth -0.03
13     y depth -0.03
14     z depth  0.09
15 carat table  0.18
16 depth table -0.30
17 table table  1.00
18 price table  0.13
19     x table  0.20
20     y table  0.18
21     z table  0.15
22 carat price  0.92
23 depth price -0.01
24 table price  0.13
25 price price  1.00
26     x price  0.88
27     y price  0.87
28     z price  0.86
29 carat     x  0.98
30 depth     x -0.03
31 table     x  0.20
32 price     x  0.88
33     x     x  1.00
34     y     x  0.97
35     z     x  0.97
36 carat     y  0.95
37 depth     y -0.03
38 table     y  0.18
39 price     y  0.87
40     x     y  0.97
41     y     y  1.00
42     z     y  0.95
43 carat     z  0.95
44 depth     z  0.09
45 table     z  0.15
46 price     z  0.86
47     x     z  0.97
48     y     z  0.95
49     z     z  1.00
2. Generate the heatmap: here we will use scale_fill_gradient2() with specific arguments.
2.1 Store the graph with all the options except for the color in variable corr_heatmap.

We carry out this step to be able to try different color scales without the need to re-run all this code.

corr_heatmap <- 

  aux_corr %>% 
    
    # Create the canvas for the graph
    ggplot(aes(x = Var1, y = Var2, fill = value)) +
    
    # Create the tiles and adjust the area 
    # to ensure adequate spacing between them
    geom_tile(aes(width = 0.965, height = 0.95)) +
    
    # Reverse the y-axis so that the 1s of the matrix are on the main
    # diagonal of the matrix
    scale_y_discrete(limits=rev) +
    
    # Print the corr values within the tiles
    geom_text(aes(label=value)) +
    
    # Use white background as spacing within the tiles
    theme(panel.background=element_rect(fill='white')) + 
    
    # Remove the x-axis and y-axis labels
    ylab("") + xlab("") +
    
    # print the title
    ggtitle("correlation matrix - diamonds dataset") 

corr_heatmap

The default color scale by geom_tile() seems inadequate. It does not have two clear ends to signal the range from -1 to 1 that the correlation coefficient may take.

2.2 Use scale_fill_gradient2() to introduce an appropriate gradient scale.

We will use the gradient scale selected in the screenshot of the website colorbrewer2 included before in this same notebook.

The meaning of all the arguments is clearly explained in the code below:

corr_heatmap +

  # Define a gradient color scale with two ends
  scale_fill_gradient2(
                      # Specify low, mid and high colors
                      low = "#053061", mid = "#f7f7f7", high = "#67001f", 
                      
                      # Numeric value of the midpoint in our scale
                      midpoint = 0,
                      
                      # Numerical limits to map the color scale to.
                      # We pick -1 and 1, the possible range of values
                      # for a corr. coefficient.
                      limit = c(-1, 1), 
                      
                      # Title printed on top of the legend
                      name = "Corr. Coeff" 
                      )

To my eyes the high end of this color scale looks too dark. Therefore I will pick the previous steps for both the low and high ends as indicated in the screenshots of the website colorbrewer2 included previously in this notebook:

corr_heatmap +

  # Define a gradient color scale with two ends
  # Further on this command on  the notebook on graph customization
  scale_fill_gradient2(
                      # Specify low, mid and high colors
                      low = "#2166ac", mid = "#f7f7f7", high = "#b2182b", 
                      
                      # Numeric value of the midpoint in our scale
                      midpoint = 0,
                      
                      # Numerical limits to map the color scale to.
                      # We pick -1 and 1, the possible ranges of values
                      # for a corr. coefficient.
                      limit = c(-1, 1), 
                      
                      # Title printed on top of the legend
                      name = "Corr. Coeff" 
                      )

This looks much nicer, at least in my estimation. Feel free to try your own scales!

More about colors and ggplot2

If you wish to further explore the topic of colors in ggplot, these are good places to start at. What we have seen in the notebook is more than enough for this course and to produce professional looking graphs.

The package paletteer

We are not going to resort to it in this course, but it is good that you know of the existence of the package paletteer, since it offers a great amount of additional pallettes you may use.

Paletteer package

Themes

Within ggplot you may specify a general theme among different standards to change the general appearance of the output produced. The default is theme_gray().

NOTE: within each theme, you may use the function theme() to make adjustments to the template as we have been doing (check the examples given in this and in other notebooks)

p1 <- ggplot(diamonds, aes(x = carat, y = price, color = price)) +
   geom_point()

p1

Examples

# The default, changes nothing
p1 + theme_gray()

# Removes the grey background and uses only bw elements for background and axes
p1 + theme_bw()

p1 + theme_classic()

Complete list of themes

The complete list of themes is to be found here (or simply using autocomplete). The website includes an example for each. Below I simply list them:

  • theme_grey()
  • theme_bw()
  • theme_linedraw()
  • theme_light()
  • theme_dark()
  • theme_minimal()
  • theme_classic()
  • theme_void()

factor(): order of categorical variables in graphs.

Let us look again at the first example of a barplot we saw in notebook 1:

ggplot(diamonds, aes(x = color, fill = color)) +
  geom_bar()

We saw in the introductory session to the subject that, for this graphs, it is good practice to order the graphs in descending or ascending order.

At this point, the variable color either contains no order information or contains an order information that does not match what we need in our graph.

To introduce the order information we need to turn this variable into a factor variable for which we specify the order. This can be attained with the function factor(), using it within mutate(). We are going to do this in 2 steps

Step 1. Get the desired ordering for the categories

The first thing is to order the colors in the desired order. In this case from largest to smallest count. That is:

aux <- 
  diamonds %>% 
  
    # define color as the grouping variable
    group_by(color) %>% 
  
    # Count the number of elements in each color group
    summarize(
      count = n()
    ) %>% 
  
    # Arrange in descending order (from largest to smallest)
    arrange(desc(count))

aux
# A tibble: 7 × 2
  color count
  <ord> <int>
1 G     11292
2 E      9797
3 F      9542
4 H      8304
5 D      6775
6 I      5422
7 J      2808

The column color of this aux dataframe now contains the colors in the appropriate order. We want to extract this color column and store it as a vector. Check the notebook on the fundamental dplyr and tibble operations to check how to do this:

# Extract the column color and store it as a vector
colors_order <- 
  aux %>% pull(color)

Step 2. Redefine the variable using factor() and mutate()

diamonds <- 
  diamonds %>% 
    mutate(
      color = factor(color, # the original variable
                     levels = colors_order) # here we specify the order
    )

If we now use the exact same code, the colors will be printed in the appropriate order

ggplot(diamonds, aes(x = color, fill = color)) +
  geom_bar()

Graph labels

Title

Option 1: ggtitle()

You may use ggtitle() to define a title for your graph

p1 + 
  ggtitle("price vs carat")

Option 2: labs()

Using this you may also specify a subtitle

p1 + 
  labs(
    title = "price vs carat",
    subtitle = "(Dataset > 50000 diamonds)",
  )

Axes labels

Option 1: xlab() and ylab()

You may use xlab() and ylab() to define the labels of the x and y axis

p1 + 
  ggtitle("price vs carat") +
  xlab("x = carat") + 
  ylab("y = price")

Option 2: labs()

p1 + 
  labs(
    title = "price vs carat",
    subtitle = "(Dataset > 50000 diamonds)",
    x = "x = carat",
    y = "y = price"
  )

Axis text. Size, rotation and color

You can feed the following arguments to the theme() function to rotate and change the size and color of the axes labels:

p1 + 
  theme(
        # Change size, color and rotate x-axis labels
        axis.text.x=element_text(size=20, color='green', angle = 90),
        
        # y-axis
        axis.text.y=element_text(size=15, color='red'),
       )

labs() for other graph labels

You may use the function labs() to include many more annotations on your figure.

The labels in the example below have been included for you to see the possibilities. It does not mean that it is best practice to include all of them always.

p1 + 
  labs(
    title = "price vs carat",
    subtitle = "(Dataset > 50000 diamonds)",
    x = "x = carat",
    y = "y = price",
    caption = "Data from ggplot's Diamonds dataset", # Includes caption
    tag = "Figure 1", # includes tag on the figure
    colour = "price of\nthe diamond" # adapts the title of the legend
                                     # \n is the newline character
  )

Saving ggplot figures

In this section we briefly explain how to save a ggplot image.

Imagine you have created a plot and stored it as p1 or some other variable name. For example:

p1 <- ggplot(diamonds, aes(x = carat, y = price, color = price)) +
   geom_point()

p1

Option 1 - have a window pop-up asking for file location

We have stored the previous graph in the variable fig1. We may store this graph as an image file in our computer with the code below.

# Save the ggplot object to the chosen location
ggsave(filename = file.choose(new = TRUE), 
       plot = p1)

This will open a window so that you can select the folder where the file is to be stored. Once you are at the desired location, type a filename and an appropriate image extension (e.g. .png). In my case I am going to specify the filename fig1.png

Then click on open and the figure will save.

You may get a prompt saying that the file does not exist and asking you if you want to create it. In that case, answer yes.

Option 2 - setting up the working directory and filename

Step 1: Setting the working directory

The file will be saved in your current working directory. Hence the importance of setting the current working directory.

This link is a video I created for you on how to set up the working directory. It is part of a playlist on the fundamentals of RStudio.

Once the working directory has been set, you can check it has been properly changed with the following command

# RETURNS THE WORKING DIRECTORY
getwd()

Step 2: Save the file

Once you have set up the working directory, you can save the file with the following command

filename = "image_name.png"

# Save the ggplot object to the chosen location
ggsave(filename = filename, 
       plot = p1)

You can change the extension to other image formats (.jpg, .svg) to get the output image in other formats.

The resulting file will be saved in the working directory you set up.

Changing the size of the saved image

ggplot adapts the size of the graph automatically to the output device. This means you can specify a larger height and width than the default ones if for some reason your visualization requires this.

To attain this, specify the arguments height, width and units when saving the figure. For example, the code below sets height = 7 inches and width = 7 inches:

# Save the ggplot object to the chosen location
ggsave(filename = file.choose(new = TRUE), 
       plot = p1, 
       width = 7,
       height = 7, 
       units = "in")

Important final note

ggplot offers many more options for graph customization. There is no way we are going to be able to cover all of it in class.

This is just an introduction on how to use of these options.

With this strong foundation you may now use your best friends Google or ChatGPT and adapt the code they return to you.

Alternatively, you may also explore this excellent book.